52 research outputs found

    MaSiF: Machine learning guided auto-tuning of parallel skeletons

    Get PDF

    Machine learning in compilers

    Get PDF
    Tuning a compiler so that it produces optimised code is a difficult task because modern processors are complicated; they have a large number of components operating in parallel and each is sensitive to the behaviour of the others. Building analytical models on which optimisation heuristics can be based has become harder as processor complexity increased and this trend is bound to continue as the world moves towards further heterogeneous parallelism. Compiler writers need to spend months to get a heuristic right for any particular architecture and these days compilers often support a wide range of disparate devices. Whenever a new processor comes out, even if derived from a previous one, the compiler’s heuristics will need to be retuned for it. This is, typically, too much effort and so, in fact, most compilers are out of date. Machine learning has been shown to help; by running example programs, compiled in different ways, and observing how those ways effect program run-time, automatic machine learning tools can predict good settings with which to compile new, as yet unseen programs. The field is nascent, but has demonstrated significant results already and promises a day when compilers will be tuned for new hardware without the need for months of compiler experts’ time. Many hurdles still remain, however, and while experts no longer have to worry about the details of heuristic parameters, they must spend their time on the details of the machine learning process instead to get the full benefits of the approach. This thesis aims to remove some of the aspects of machine learning based compilers for which human experts are still required, paving the way for a completely automatic, retuning compiler. First, we tackle the most conspicuous area of human involvement; feature generation. In all previous machine learning works for compilers, the features, which describe the important aspects of each example to the machine learning tools, must be constructed by an expert. Should that expert choose features poorly, they will miss crucial information without which the machine learning algorithm can never excel. We show that not only can we automatically derive good features, but that these features out perform those of human experts. We demonstrate our approach on loop unrolling, and find we do better than previous work, obtaining XXX% of the available performance, more than the XXX% of previous state of the art. Next, we demonstrate a new method to efficiently capture the raw data needed for machine learning tasks. The iterative compilation on which machine learning in compilers depends is typically time consuming, often requiring months of compute time. The underlying processes are also noisy, so that most prior works fall into two categories; those which attempt to gather clean data by executing a large number of times and those which ignore the statistical validity of their data to keep experiment times feasible. Our approach, on the other hand guarantees clean data while adapting to the experiment at hand, needing an order of magnitude less work that prior techniques

    COLAB:A Collaborative Multi-factor Scheduler for Asymmetric Multicore Processors

    Get PDF
    Funding: Partially funded by the UK EPSRC grants Discovery: Pattern Discovery and Program Shaping for Many-core Systems (EP/P020631/1) and ABC: Adaptive Brokerage for Cloud (EP/R010528/1); Royal Academy of Engineering under the Research Fellowship scheme.Increasingly prevalent asymmetric multicore processors (AMP) are necessary for delivering performance in the era of limited power budget and dark silicon. However, the software fails to use them efficiently. OS schedulers, in particular, handle asymmetry only under restricted scenarios. We have efficient symmetric schedulers, efficient asymmetric schedulers for single-threaded workloads, and efficient asymmetric schedulers for single program workloads. What we do not have is a scheduler that can handle all runtime factors affecting AMP for multi-threaded multi-programmed workloads. This paper introduces the first general purpose asymmetry-aware scheduler for multi-threaded multi-programmed workloads. It estimates the performance of each thread on each type of core and identifies communication patterns and bottleneck threads. The scheduler then makes coordinated core assignment and thread selection decisions that still provide each application its fair share of the processor's time. We evaluate our approach using the GEM5 simulator on four distinct big.LITTLE configurations and 26 mixed workloads composed of PARSEC and SPLASH2 benchmarks. Compared to the state-of-the art Linux CFS and AMP-aware schedulers, we demonstrate performance gains of up to 25% and 5% to 15% on average depending on the hardware setup.Postprin

    Parallel-Pattern Aware Compiler Optimisations:Challenges and Opportunities

    Get PDF
    This report outlines our finding that existing compilers are not aware of the pattern semantics and thus miss massive optimisation opportunities

    MaSiF: Machine Learning Guided Auto-tuning of Parallel Skeletons

    Get PDF

    Iterative Compilation on Mobile Devices

    Get PDF
    The abundance of poorly optimized mobile applications coupled with their increasing centrality in our digital lives make a framework for mobile app optimization an imperative. While tuning strategies for desktop and server applications have a long history, it is difficult to adapt them for use on mobile phones. Reference inputs which trigger behavior similar to a mobile application's typical are hard to construct. For many classes of applications the very concept of typical behavior is nonexistent, each user interacting with the application in very different ways. In contexts like this, optimization strategies need to evaluate their effectiveness against real user input, but doing so online runs the risk of user dissatisfaction when suboptimal optimizations are evaluated. In this paper we present an iterative compiler which employs a novel capture and replay technique in order to collect real user input and use it later to evaluate different transformations offline. The proposed mechanism identifies and stores only the set of memory pages needed to replay the most heavily used functions of the application. At idle periods, this minimal state is combined with different binaries of the application, each one build with different optimizations enabled. Replaying the targeted functions allows us to evaluate the effectiveness of each set of optimizations for the actual way the user interacts with the application. For the BEEBS benchmark suite, our approach was able to improve performance by up to 57%, while keeping the slowdown experienced by the user on average at 0.8%. By focusing only on heavily used functions, we are able to conserve storage space by between two and three orders of magnitude compared to typical capture and replay implementations.Comment: 8 pages, 8 figure

    Code Translation with Compiler Representations

    Full text link
    In this paper, we leverage low-level compiler intermediate representations (IR) to improve code translation. Traditional transpilers rely on syntactic information and handcrafted rules, which limits their applicability and produces unnatural-looking code. Applying neural machine translation (NMT) approaches to code has successfully broadened the set of programs on which one can get a natural-looking translation. However, they treat the code as sequences of text tokens, and still do not differentiate well enough between similar pieces of code which have different semantics in different languages. The consequence is low quality translation, reducing the practicality of NMT, and stressing the need for an approach significantly increasing its accuracy. Here we propose to augment code translation with IRs, specifically LLVM IR, with results on the C++, Java, Rust, and Go languages. Our method improves upon the state of the art for unsupervised code translation, increasing the number of correct translations by 11% on average, and up to 79% for the Java -> Rust pair with greedy decoding. With beam search, it increases the number of correct translations by 5.5% in average. We extend previous test sets for code translation, by adding hundreds of Go and Rust functions. Additionally, we train models with high performance on the problem of IR decompilation, generating programming source code from IR, and study using IRs as intermediary pivot for translation.Comment: 9 page
    • 

    corecore